Let’s start with the data on median household income and median age by county in the state of California from the 2016-2020 ACS.
library(tidycensus)ca_wide <-get_acs(geography ="county",state ="CA",variables =c(medinc ="B19013_001", # median household incomemedage ="B01002_001"# median age ),output ="wide",year =2020) %>%print()
# A tibble: 58 × 6
GEOID NAME medincE medincM medageE medageM
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 06001 Alameda County, California 104888 1041 37.8 0.2
2 06003 Alpine County, California 85750 23875 47.6 9.5
3 06007 Butte County, California 54972 2501 36.9 0.2
4 06011 Colusa County, California 59427 5384 35.4 0.3
5 06013 Contra Costa County, California 103997 1239 39.9 0.1
6 06017 El Dorado County, California 83710 2228 46.3 0.2
7 06019 Fresno County, California 57109 929 32.4 0.2
8 06023 Humboldt County, California 49235 2278 38.9 0.4
9 06025 Imperial County, California 46222 2499 32.5 0.3
10 06029 Kern County, California 54851 1087 31.9 0.2
# … with 48 more rows
Our first ggplot figure is a histogram of median income:
library(tidyverse)# avoid using scientific notationoptions(scipen =999)ggplot(ca_wide) +geom_histogram(aes(x = medincE))
Exercise. Display the documentation for geom_histogram().
Exercise. By default, geom_histogram uses 30 bins. Change the bins argument to other values.
Exercise. Display the histogram of the median ages.
Boxplot of the median household income:
ggplot(ca_wide) +geom_boxplot(aes(y = medincE))
Exercise. Display the boxplot of the median ages.
Plot the relationship between two variables by a scatterplot:
ggplot(ca_wide) +geom_point(aes(x = medageE, y = medincE))
Overlay with a linear model fit. Note that both geom_point() and geom_smooth layers inherit the mapping aes(x = medageE, y = medincE).
ggplot(ca_wide, aes(x = medageE, y = medincE)) +geom_point() +geom_smooth(method ="lm")
Overlay with a non-linear model fit:
ggplot(ca_wide, aes(x = medageE, y = medincE)) +geom_point() +geom_smooth(method ="loess")
3 Customizing ggplot visualization
Suppose we want to illustrate the percent of commuters that take public transportation to work for the largest metropolitan areas in the United States, using data from the 2019 1-year ACS.
library(tidycensus)library(tidyverse)metros <-get_acs(geography ="cbsa", # metropolitan statistical areavariables ="DP03_0021P", # percentage of public transportation commuterssummary_var ="B01003_001", # total populationsurvey ="acs1",year =2019) %>%# 10 most populous CBSAsslice_max(summary_est, n =20) %>%print()
# A tibble: 20 × 7
GEOID NAME varia…¹ estim…² moe summa…³ summa…⁴
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 35620 New York-Newark-Jersey City, NY-… DP03_0… 31.6 0.2 1.92e7 NA
2 31080 Los Angeles-Long Beach-Anaheim, … DP03_0… 4.8 0.1 1.32e7 NA
3 16980 Chicago-Naperville-Elgin, IL-IN-… DP03_0… 12.4 0.3 9.46e6 1469
4 19100 Dallas-Fort Worth-Arlington, TX … DP03_0… 1.3 0.1 7.57e6 NA
5 26420 Houston-The Woodlands-Sugar Land… DP03_0… 2 0.2 7.07e6 NA
6 47900 Washington-Arlington-Alexandria,… DP03_0… 13.1 0.4 6.28e6 2482
7 33100 Miami-Fort Lauderdale-Pompano Be… DP03_0… 2.9 0.2 6.17e6 NA
8 37980 Philadelphia-Camden-Wilmington, … DP03_0… 9.4 0.3 6.10e6 NA
9 12060 Atlanta-Sandy Springs-Alpharetta… DP03_0… 2.8 0.2 6.02e6 3340
10 38060 Phoenix-Mesa-Chandler, AZ Metro … DP03_0… 1.8 0.2 4.95e6 NA
11 14460 Boston-Cambridge-Newton, MA-NH M… DP03_0… 13.4 0.4 4.87e6 NA
12 41860 San Francisco-Oakland-Berkeley, … DP03_0… 18.9 0.4 4.73e6 NA
13 40140 Riverside-San Bernardino-Ontario… DP03_0… 1.1 0.2 4.65e6 NA
14 19820 Detroit-Warren-Dearborn, MI Metr… DP03_0… 1.4 0.2 4.32e6 NA
15 42660 Seattle-Tacoma-Bellevue, WA Metr… DP03_0… 10.7 0.4 3.98e6 NA
16 33460 Minneapolis-St. Paul-Bloomington… DP03_0… 4.5 0.2 3.64e6 NA
17 41740 San Diego-Chula Vista-Carlsbad, … DP03_0… 2.8 0.3 3.34e6 NA
18 45300 Tampa-St. Petersburg-Clearwater,… DP03_0… 1.1 0.2 3.19e6 NA
19 19740 Denver-Aurora-Lakewood, CO Metro… DP03_0… 4.5 0.3 2.97e6 NA
20 41180 St. Louis, MO-IL Metro Area DP03_0… 1.8 0.2 2.80e6 1618
# … with abbreviated variable names ¹variable, ²estimate, ³summary_est,
# ⁴summary_moe
Bar chart:
ggplot(metros) +geom_col(aes(x = NAME, y = estimate))
Shorten metro names, reorder by the percentage (estimat), rotate the bar chart, and more meaningful title and axis labels:
metros %>%mutate(NAME =str_remove(NAME, "-.*$")) %>%mutate(NAME =str_remove(NAME, ",.*$")) %>%ggplot(aes(y =reorder(NAME, estimate), x = estimate)) +geom_col() +theme_minimal() +labs(title ="Public transit commute share", subtitle ="2019 1-year ACS estimates", y ="", x ="ACS estimate", caption ="Source: ACS Data Profile variable DP03_0021P")
More styles:
library(scales)metros %>%mutate(NAME =str_remove(NAME, "-.*$")) %>%mutate(NAME =str_remove(NAME, ",.*$")) %>%ggplot(aes(y =reorder(NAME, estimate), x = estimate)) +geom_col(color ="navy", fill ="navy", alpha =0.5, width =0.85) +theme_minimal(base_size =12, base_family ="Verdana") +scale_x_continuous(labels =label_percent(scale =1)) +labs(title ="Public transit commute share", subtitle ="2019 1-year ACS estimates", y ="", x ="ACS estimate", caption ="Source: ACS Data Profile variable DP03_0021P")
4 Visualizing margins of error
Let’s visualize the median household incomes of counties in the state of California from the 2016-2020 ACS.
California counties and population sizes:
ca <-get_decennial(state ="CA",geography ="county",variables =c(totalpop ="P1_001N"),year =2020 ) %>%arrange(desc(value)) %>%print()
# A tibble: 58 × 4
GEOID NAME variable value
<chr> <chr> <chr> <dbl>
1 06037 Los Angeles County, California totalpop 10014009
2 06073 San Diego County, California totalpop 3298634
3 06059 Orange County, California totalpop 3186989
4 06065 Riverside County, California totalpop 2418185
5 06071 San Bernardino County, California totalpop 2181654
6 06085 Santa Clara County, California totalpop 1936259
7 06001 Alameda County, California totalpop 1682353
8 06067 Sacramento County, California totalpop 1585055
9 06013 Contra Costa County, California totalpop 1165927
10 06019 Fresno County, California totalpop 1008654
# … with 48 more rows
# A tibble: 58 × 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 06001 Alameda hhincome 104888 1041
2 06003 Alpine hhincome 85750 23875
3 06005 Amador hhincome 65187 4225
4 06007 Butte hhincome 54972 2501
5 06009 Calaveras hhincome 67054 4535
6 06011 Colusa hhincome 59427 5384
7 06013 Contra Costa hhincome 103997 1239
8 06015 Del Norte hhincome 49981 5145
9 06017 El Dorado hhincome 83710 2228
10 06019 Fresno hhincome 57109 929
# … with 48 more rows
Basic plot:
ggplot(ca_income, aes(x = estimate, y =reorder(NAME, estimate))) +geom_point(size =1.5, color ="darkgreen") +labs(title ="Median household income", subtitle ="Counties in California", x ="", y ="ACS estimate") +theme_minimal(base_size =9) +scale_x_continuous(labels =label_dollar()) +scale_y_discrete(guide =guide_axis(n.dodge =2))
Add error bar (from margin of errors):
ggplot(ca_income, aes(x = estimate, y =reorder(NAME, estimate))) +geom_errorbarh(aes(xmin = estimate - moe, xmax = estimate + moe)) +geom_point(size =1.5, color ="darkgreen") +theme_minimal(base_size =8) +labs(title ="Median household income", subtitle ="Counties in California", x ="2016-2020 ACS estimate", y ="") +scale_x_continuous(labels =label_dollar()) +scale_y_discrete(guide =guide_axis(n.dodge =2))
What do we observe?
ca %>%arrange(value)
# A tibble: 58 × 4
GEOID NAME variable value
<chr> <chr> <chr> <dbl>
1 06003 Alpine County, California totalpop 1204
2 06091 Sierra County, California totalpop 3236
3 06049 Modoc County, California totalpop 8700
4 06051 Mono County, California totalpop 13195
5 06105 Trinity County, California totalpop 16112
6 06043 Mariposa County, California totalpop 17131
7 06027 Inyo County, California totalpop 19016
8 06063 Plumas County, California totalpop 19790
9 06011 Colusa County, California totalpop 21839
10 06015 Del Norte County, California totalpop 27743
# … with 48 more rows
5 Visualize ACS estimates over time
Let’s obtain 1-year ACS data from 2005 through 2021 (no 2020 ACS data) on median home value for Deschutes County, Oregon, home to the city of Bend and large numbers of in-migrants in recent years from the Bay Area in California.
years <-c(2005:2019, 2021)names(years) <- yearsdeschutes_value <-map_dfr(years, ~{get_acs(geography ="county",variables ="B25077_001", # median home valuestate ="OR",county ="Deschutes",year = .x,survey ="acs1" )}, .id ="year") %>%print()
# A tibble: 16 × 6
year GEOID NAME variable estimate moe
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 2005 41017 Deschutes County, Oregon B25077_001 236100 13444
2 2006 41017 Deschutes County, Oregon B25077_001 336600 11101
3 2007 41017 Deschutes County, Oregon B25077_001 356700 16765
4 2008 41017 Deschutes County, Oregon B25077_001 331600 17104
5 2009 41017 Deschutes County, Oregon B25077_001 284300 12652
6 2010 41017 Deschutes County, Oregon B25077_001 260700 18197
7 2011 41017 Deschutes County, Oregon B25077_001 216200 18065
8 2012 41017 Deschutes County, Oregon B25077_001 235300 19016
9 2013 41017 Deschutes County, Oregon B25077_001 240000 16955
10 2014 41017 Deschutes County, Oregon B25077_001 257200 18488
11 2015 41017 Deschutes County, Oregon B25077_001 312600 18751
12 2016 41017 Deschutes County, Oregon B25077_001 317500 17479
13 2017 41017 Deschutes County, Oregon B25077_001 368600 16477
14 2018 41017 Deschutes County, Oregon B25077_001 408500 16808
15 2019 41017 Deschutes County, Oregon B25077_001 413500 19506
16 2021 41017 Deschutes County, Oregon B25077_001 543900 30518
A stylized plot:
ggplot(deschutes_value, aes(x = year, y = estimate, group =1)) +geom_ribbon(aes(ymax = estimate + moe, ymin = estimate - moe),fill ="navy",alpha =0.4 ) +geom_line(color ="navy") +geom_point(color ="navy", size =2) +theme_minimal(base_size =12) +scale_y_continuous(labels =label_dollar(scale = .001, suffix ="k")) +labs(title ="Median home value in Deschutes County, OR",x ="Year",y ="ACS estimate",caption ="Shaded area represents margin of error around the ACS estimate" )
6 Exploring age and sex structure with population pyramids
We use data from the Population Estimates API for the state of California:
# A tibble: 96 × 5
GEOID NAME value SEX AGEGROUP
<chr> <chr> <dbl> <chr> <fct>
1 06 California 39512223 Both sexes All ages
2 06 California 2383716 Both sexes Age 0 to 4 years
3 06 California 2478850 Both sexes Age 5 to 9 years
4 06 California 2529137 Both sexes Age 10 to 14 years
5 06 California 2533694 Both sexes Age 15 to 19 years
6 06 California 2647279 Both sexes Age 20 to 24 years
7 06 California 3094978 Both sexes Age 25 to 29 years
8 06 California 2957974 Both sexes Age 30 to 34 years
9 06 California 2780574 Both sexes Age 35 to 39 years
10 06 California 2501526 Both sexes Age 40 to 44 years
# … with 86 more rows
Remove rows for Both sexes and flip values of Male to negative.
ca_filtered <- ca %>%filter(str_detect(AGEGROUP, "^Age"), SEX !="Both sexes" ) %>%mutate(value =ifelse(SEX =="Male", -value, value)) %>%print()
# A tibble: 36 × 5
GEOID NAME value SEX AGEGROUP
<chr> <chr> <dbl> <chr> <fct>
1 06 California -1220085 Male Age 0 to 4 years
2 06 California -1266874 Male Age 5 to 9 years
3 06 California -1292852 Male Age 10 to 14 years
4 06 California -1292178 Male Age 15 to 19 years
5 06 California -1359751 Male Age 20 to 24 years
6 06 California -1600681 Male Age 25 to 29 years
7 06 California -1525571 Male Age 30 to 34 years
8 06 California -1416586 Male Age 35 to 39 years
9 06 California -1255993 Male Age 40 to 44 years
10 06 California -1252341 Male Age 45 to 49 years
# … with 26 more rows
Pyramid plot:
ggplot(ca_filtered) +geom_col(aes(x = value, y = AGEGROUP, fill = SEX))
A more stylized pyramid:
ca_pyramid <-ggplot(ca_filtered, aes(x = value, y = AGEGROUP, fill = SEX)) +geom_col(width =0.95, alpha =0.75) +theme_minimal(base_family ="Verdana", base_size =12) +scale_x_continuous(labels =~number_format(scale = .001, suffix ="k")(abs(.x)),limits =1000000*c(-1.75, 1.75) ) +scale_y_discrete(labels =~str_remove_all(.x, "Age\\s|\\syears")) +scale_fill_manual(values =c("darkred", "navy")) +labs(x ="", y ="2019 Census Bureau population estimate", title ="Population structure in California", fill ="", caption ="Data source: US Census Bureau population estimates & tidycensus R package")ca_pyramid
With aid of the geofacet package, we can visualize pyramids of multiple states:
# A tibble: 1,872 × 6
GEOID NAME value SEX AGEGROUP prop
<chr> <chr> <dbl> <chr> <fct> <dbl>
1 28 Mississippi 93187 Male Age 0 to 4 years -0.0313
2 28 Mississippi 96334 Male Age 5 to 9 years -0.0324
3 28 Mississippi 105273 Male Age 10 to 14 years -0.0354
4 28 Mississippi 102782 Male Age 15 to 19 years -0.0345
5 28 Mississippi 102207 Male Age 20 to 24 years -0.0343
6 28 Mississippi 104027 Male Age 25 to 29 years -0.0350
7 28 Mississippi 91553 Male Age 30 to 34 years -0.0308
8 28 Mississippi 92312 Male Age 35 to 39 years -0.0310
9 28 Mississippi 85188 Male Age 40 to 44 years -0.0286
10 28 Mississippi 87453 Male Age 45 to 49 years -0.0294
# … with 1,862 more rows
ggplot(us_pyramid_data, aes(x = prop, y = AGEGROUP, fill = SEX)) +geom_col(width =1) +theme_minimal() +scale_fill_manual(values =c("darkred", "navy")) +facet_geo(~NAME, grid ="us_state_with_DC_PR_grid2",label ="code") +theme(axis.text =element_blank(),strip.text.x =element_text(size =8)) +labs(x ="", y ="", title ="Population structure by age and sex", fill ="", caption ="Data source: US Census Bureau population estimates & tidycensus R package")
# A tibble: 4,188 × 7
# Groups: county [6]
GEOID tract county state varia…¹ estim…² moe
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 06025010101 Census Tract 101.01 Imperial County Califo… B25077… NA NA
2 06025010102 Census Tract 101.02 Imperial County Califo… B25077… 144000 11653
3 06025010200 Census Tract 102 Imperial County Califo… B25077… 157800 21388
4 06025010300 Census Tract 103 Imperial County Califo… B25077… 295300 52583
5 06025010401 Census Tract 104.01 Imperial County Califo… B25077… 138300 61660
6 06025010402 Census Tract 104.02 Imperial County Califo… B25077… 147200 23108
7 06025010500 Census Tract 105 Imperial County Califo… B25077… 230100 14814
8 06025010600 Census Tract 106 Imperial County Califo… B25077… 257000 23904
9 06025010700 Census Tract 107 Imperial County Califo… B25077… 163100 25943
10 06025010800 Census Tract 108 Imperial County Califo… B25077… 285100 32973
# … with 4,178 more rows, and abbreviated variable names ¹variable, ²estimate
Summary statistics:
housing_val %>%summarize(min =min(estimate, na.rm =TRUE), mean =mean(estimate, na.rm =TRUE), median =median(estimate, na.rm =TRUE), max =max(estimate, na.rm =TRUE) ) %>%arrange(desc(median)) %>%print()
# A tibble: 6 × 5
county min mean median max
<chr> <dbl> <dbl> <dbl> <dbl>
1 Orange County 59600 744923. 658600 2000001
2 Santa Barbara County 197600 735344. 638050 2000001
3 Los Angeles County 45200 674727. 576800 2000001
4 Ventura County 129400 610242. 573400 2000001
5 San Diego County 9999 631103. 561000 2000001
6 Imperial County 17300 189126. 189600 302300
Density plots:
ggplot(housing_val, aes(x = estimate, fill = county)) +geom_density(alpha =0.3)
Choose a different variable in the ACS and/or a different location and create a margin of error visualization of your own.
Modify the population pyramid code to create a different, customized population pyramid. You can choose a different location (state or county), different colors/plot design, or some combination!